DOCOHEHI BESOHE 



ED 093 926 



TH 003 744 



AUTHOB 



IHSIITOTION 

PUB SATE 
NOTE 



EDES FBICE 
DESCEI'PTOES 



Lai, Morris K. 

The Case Against Tests of Statistical Signif ica&c,e« 
Teacher Education Division Publication Series, Report 
A73-20, 

Far Hest Lab, for Educational Besearch and 

Development, San Francisco, Calif. 

£733 

lOp.; Paper prelsented at the Annual Heeti^ng of the 
California Educational Besearc> Association (Lcs 
Angeles, California, 1973) 

MF*$0*75 HC-$1.50 PLUS POjSTAGE 

4<Analysis of Variance; ♦Hypothesis Testing; 

♦l^robleas; Statis,tical Analysis; ^Tests of 

Significance 



ABSTBACT 

^The purposes of this pap^r are to: (1) describe sone 
of the serious shortcoaings in the current use of tests of 
statistical sigliif q^cance, (2) discuss bov aisuses are perpetuated in 
sone widely used references, and (3) present an alternative 
significance testing aodel that overcoaes soae,, but not all, of the 
shortcomings of the currently ' used method. For the purposes of t^iis 
paper, the discussion is restricted to fixed effects analysis qt 
variance (ANOVA) (including t-tests) , whidh is perhaps the most 
pervasive of the data analyses used by educational researchers, 
(Author) 




ERIC 



TEACHER EDUCATION DIVISION 
PUBLICATION SE(^IES 

• # • * 



THE CASt AGAINST TESTS OF STATISTICAL SIGNIFICANCE- 



Morris K. Lai 



Paper presented at the\nnual meeting of ^he California 
Educational^ Research Association, ^Los Angeles, 1973 



us DEPARTMCN70F HEALTH. 
E0UCAT40N A WELFARE 
NATIONAL INSTITUtE OF 
EDUCATION 

THIS DOCUMENT HAS BEEN REPRO 
DUCeO EXACTLY AS RECEIVED PROM 
^ME PERSON OR ORGANIZATION ORIGIN 
AT IMG IT POINTS OF VIEWV OR OPINIONS 
sVrTEO'OO NOT NECESSARILY RF.PRE 
SENT OFFICIAL NATIONAL iMSTiTUTH OF 
EDUCATION POSITION OR POLICY 



REPORT A73-20 




FAR WEST LABORATORY FOR EOUCATIONAL RESEARCH AND DEVELOPMENT 

1855 Folsam Street. San Francisco, California, 94103, (4]5) 565-3000 



The Case Against Tests of Statistical Significance 



- . ^ Morris K. Lai ' . 

lall Fo?c2I*5!I2 for Educational Research and Development 
1855 Fol son Street, San Francisco, California 94103 




if 



Paper presented at the Annual Meeting of the 
California Educational Research Association, Los Angeles, 1973. 

i 

\ ^ • 



The Case Against Tests of Statistical Significance 

Morris Lai 

Far West Laboratc^ for Educational Research and Development 

The purpose of this paper Is to (1) describe some of the serious short- 
comings in the^ current use of tests of statistical* significance, (2) discuss 
how misuses are perpetuated in some widely used references, and (3) present, 
an aUemative significance testing model that overcomes some, but not all, 
^ of the shortcomings of the currently used method. 

Defining "testing statistical significance" 

For the purposes of this paper, the discussion will be restricted to * ' 
fixed effects analysis of variance (ANOVA) /including t-terte), which is 
perhaps the most pervasive of the datdNanalyses used by educational researchers 
A test of statistical significance Is basically a process whereby two or more 
groups are pompared, and for whatever difference is found, a *'p value" is 
calculated which is the probabi 1 1 ty that a difference that large or larger 
would have arisen in a sample had the groups been truly equivalent as 
populations. s 




^ test Statistic 

Distribution of test statistic when groups are 
equivalent in the population . 




shaded area « 

p value fbr observed 

test statistic 



test Statistic 



Observed value of test statistic 

For observed test statistics that are sufficiently large, the p values 
are correspondingly small (i.e., statistically significant). 

Random assignment 

Such a model requires, to start off with, random sampling. If assign^ 

i . . ' " 

ment to treatment is not random, then a test of slgn^iflcance is i inappropriate 
(Morrison & Henkel, 1969). 

Type i error rate 

Nearly every textbook on Inferential statistics discusses the concept 
of Type I and Type II errors. Despite warnings from Horst (1966), Skipper 
et al. (1967), and Winer (1971) about the Inapproprlateness of endowing 
Type I error rates of .05 and .01 with some sort bf sacredn'dss, the pre- . . 
valence of such sacredni^ss Is well known (e.g., the APA Publication manual 
advocates one asterisk for jS < .05 and two asterisks for p < .01). 

Practical or educational significance 

. It "is popular tod^ to exhibit some enlightenment by emphasizing that 
statistical significance does not necessarily imply practical or educational » 
significance. ,Yet in Gujl ford's (1956) widely used textbook we find the 
following quote: (p. 275) 



^ The F ratio for machines is significant beyond the .01 level, 
leaving us with considerable confidence that the machine 
differences^ as such, have a real bearing upon the difficulty 
of the task. / 

Such a significant F could have resulted where the differences wei^ trivial 

in the practical sense, pother misuse of p levels occurs when researchers" 

use significance levels to con^idre results from several studies (e.g., 

Eysenck, 1960; Br^icht, 1970). 

Type II error rates, power, and accepting null ^potheses 
• ^ Type II error rates and power calculations are less familiar to 
researchers, \iyone who accepts a null hypothesis, without knowing the 
power of the statistical test, is liable to have a huge Type II errar rate. 
Yet Popham (1967) in his text writes "...hypothesis under consideration is 
either accepted or rejected." Glass and Stanley (1970 also mislead their 
readers by advocating, without consideration of power, the acceptance of 
the null hypothesis when statistical significance 1s1iot attained. Other 
jfrriters who advocate (inappropriately) the accepting of null hypotheses 

if a significant statistic is not observed include Walter and Lev (1953) 

\ , ' . ■ 

Guilford (1956). a)(id Kirk. (1968).- 

It is possiblfe to prove algebraically that for a predetermined level 

of significance, there exist normal distributions such that the F or t 

statistic will not tie significant, but the size of the effects will be 

larger than any predetermined number. As such, a researcher who accepts 

a null hypothesis without knowing the power of the test may be -call ing a 

very large difference a "zero difference." McNemar*s (1962) suggestion of 

using three regions (acceptance, suspended judgment, and rejection), 

depending on the p level, does not overcome this objection. > ' 



Sample size 

Another problem that I >(*m discuss Is determining sample size. Any 
scientist appreciates the fact that the larger the sample, 'the more 1nf<)r-' 
mation one has. Aside from cost-benefit considerations and manaQeability, 
it is illogical to sj^y that a sAwiller sample |s more desirable than a larger 
one; for example, Heiys (1963) clearly states that for precision, the bigger 
the sample site the better. ,Yet on the next pcqe (p. 3^4) he suggests that 
the iTesearcher^ask the following question: "Is the sample size large , 
enough to give confidence that the big associations will indeed show up, 
while being small enough so that trivial associations will be excluded 
from significance?" If a procedure is such that it results in worry aboyt 
whethier a sample size is small enough, then surely something is seriously 
wrong with that procedure. 

Appropriate null hypotheses ^ ; . 

. The last problem I will discuss deals with null hypotheses. The un- 
questioning arcceptance of always ^ difference null hypothesis has 
been criticized by several writers (e,g.» Grant (1962); Kerllnger (191^); 
Cohen (1969). Dixon and Massey (1969) arid Pena (1970) have both presented 
a procedure for testing non-zero null hypotheses for the two sample case. 
The 1ncor})orat1on of 3 predetermined minimum practical difference Into the 
null hypothesis (now non-zero) ties In the statistical and practical 
significance. ' means of this rarely used procedure » a researcher can 
state more appropriate null hypotheses.^ Instead of asklfig if there Is a 
, difference at all, researchers usdfclly should be asking whether or not 
there is an^educatiional or practical difference* Instead of asking whether 



a. Datsun gets better mileage than a Cadillac, we should be asking how many 
more gallons a Datsun gets and whether this dlffer^ce was of practlca] 
Importance. Likewise Instead of asking whether one group has scored higher 
than anotheri^ we should be asking how much higher one group ha$ scored 
than another and if this difference Is of practical or educational Importance. 

Summary . ^ • . * 

In summary t well respected writers have suggested that researchers do 
the following (1) test null hypotheses that are usually inappropriate, (2) 
accept these null hypotheses without regard to power (and possibly have 
huge Type n errors), (3) use arbitrary (sacred) rejection probability 
levels of .05' and .01, and (4) be careful In not getting too large a sample 
size. I 

These misleading (inappropriate) recownendations are interrelat^id in 
that their disappearance wo.uld be highly correlated with the elimination 
of tests of significance. But change comes slowly ^and I propose an 
analysis of variance methodology that gets rid of (1) and (4) (inappropriate 
null hypotheses and .the illogical concept of a sample being too large.) 

Noncentral analysis of variance 

The method can perhaps be best understood in terms of Its being an 
extension of the two sample case \i\\ich ^as been described by 01xoi> and 
Massey (1969). The analog to the minimum practical difference, is <s , the 
noncentral ity parameter of the noncentral F distribution. Oust as the 
ofdinairy F distribution is associated-with a zero difference null hypothesis, 
the noncentral F distribution is associated with a non-zero null hypothesis. 
Minimum practical differences are now stated in terms of average differences 
between groups. . 



The derivation of the noncentral ANOVA model Is complex and will be 
presented in more detail In another paper. The use, however, Is rather 
Simple. Having determined the minjimum practical difference, a researcher 
need only use a table to determirie ,the noncentrality parameters.' He then 
rejects the (nonzero) null hypothesis if his observed F statistic exceeds 
Fv p V2» 6 O-a), where v^and are the usual parameters that determine 
the central F distribution, 6 is the noncentrality parameter and a is the 
Type I error rate chosen. 

Such a procedure results In an appropriate adjustment for sample size 
Thus, statistical significance Is not attainable by merely increasing the 
sample size. The inogical concept of too large a sample no longer exists 
At the same time, appropriate null hypotheses are beinq tested. 



Bibliography ' . 

American Psychological Association, Publication ma nual. Washinqton. D C • 
APA, 1967. ' = — 

Bracht, G. "Experimental factors related to aptitude- treatment interactions. 
. Review of Educational Research . 1970, 40(5), 627-645. 

Cohen, J. Statistical power analysis for the behavioral sciences .* New York; 
Academic Press, 1969. ~ 

Dixon, W. , and Massey, F.. Jr. Introduction to s tatistical analysis. New • 
York: McGraw-Hill, 1969. ^ 

Eysenck, y. J. "The concept of statistical significance and the controversy 
about one- tailed tests." Psychological Review . 1960. 67(4), 269-271. 



Glass. G., and .Stanley, J. C. Statistical methods in education and" 
' psycho lofly. Englewood Cliffs, fil.J.: Prentice-Hall, 1370. 

Grant, Dt A. Testiag the null hypothesis and the strategy and tactics of 
" investigating tf)(Boretical models. Psychological Review , 1962, 69, 
54-61. -y ' \ ~ 

Guilford, J. P./ Fundamental statistics in psychology and education . 
New York: McGraw-Hill, 1956. 

Hays, W. L. Statistics . New York: Holt, Rinehart & Winston, 1963. 

Horst, P. Psychological measurement and prediction. Belmoat, California: 
Wadsworth, 1966. . ^ . 

Kerl,inger, F. Foundations of behavioral research . New York: Holt, 
Rinehart & Winston, 1964. ~ "~ 

Kirk, R. Experimental desigg: procedures for the behavioral sciences . 

Belmont, California: Br^boks/Cole. 1968. r~~ 

V 

McNemar, Q. Psychological statistics . New York: John Wiley & Sons, 1962. 

Morrison, D. E., and Henkel , R. E. Significance tests reconsfd(Bred. The 

American Sociologist . 1969, 4., 131-140. 

^ .• , - ■ 

Morrison, D. E., and Henkel, R. E. The significance test controvers.^ . 

Chicago: Alciine, 1970. • , ; , 

Pena, D. A significant difference of opinion with the Coats position. 
Educational Researcher , 1970, 21, 9-10. 

Poi)ham, W. J. Educational statistics . New York: Harper & Row, 1967. 

Rosf.nkrantz, R. The significance test controversy. ESucational Researciier , 
1972, 1(1.2), 10-14. 



Skipper, J. K., Guenther, A.C, and Nass, G. The sacredness of .05: a 
note concerning the uses of statistical levels of significance in 
social science. The American Sociologist , 1967» X» 16-18. ' 

Walker, H. M., & Lev, J. Statistical Inference . New York: Henry Holt 
& Company, 1953. 

Winer, B. J. Statistical principles in experimental design . New York: 
McGraw-Hill, 1571.' 



' - /' 



